AITopics | Data Cleaning

Revenue maximization via machine learning with noisy data

Neural Information Processing SystemsApr-26-2026, 00:25:13 GMT

Increasingly, copious amounts of consumer data are used to learn high-revenue mechanisms via machine learning. Existing research on mechanism design via machine learning assumes that there is a distribution over the buyers' values for the items for sale and that the learning algorithm's input is a training set sampled from this distribution. This setup makes the strong assumption that no noise is introduced during data collection. In order to help place mechanism design via machine learning on firm foundations, we investigate the extent to which this learning process is robust to noise. Optimizing revenue using noisy data is challenging because revenue functions are extremely volatile: an infinitesimal change in the buyers' values can cause a steep drop in revenue. Nonetheless, we provide guarantees when arbitrarily correlated noise is added to the training set; we only require that the noise has bounded magnitude or is sub-Gaussian. We conclude with an application of our guarantees to multi-task mechanism design, where there are multiple distributions over buyers' values and the goal is to learn a high-revenue mechanism per distribution. To our knowledge, we are the first to study mechanism design via machine learning with noisy data as well as multi-task mechanism design.

artificial intelligence, data quality, machine learning, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.80)

Add feedback

Revenue maximization via machine learning with noisy data

Neural Information Processing SystemsApr-26-2026, 00:25:10 GMT

Increasingly, copious amounts of consumer data are used to learn high-revenue mechanisms via machine learning. Existing research on mechanism design via machine learning assumes that there is a distribution over the buyers' values for the items for sale and that the learning algorithm's input is a training set sampled from this distribution. This setup makes the strong assumption that no noise is introduced during data collection. In order to help place mechanism design via machine learning on firm foundations, we investigate the extent to which this learning process is robust to noise. Optimizing revenue using noisy data is challenging because revenue functions are extremely volatile: an infinitesimal change in the buyers' values can cause a steep drop in revenue. Nonetheless, we provide guarantees when arbitrarily correlated noise is added to the training set; we only require that the noise has bounded magnitude or is sub-Gaussian. We conclude with an application of our guarantees to multi-task mechanism design, where there are multiple distributions over buyers' values and the goal is to learn a high-revenue mechanism per distribution. To our knowledge, we are the first to study mechanism design via machine learning with noisy data as well as multi-task mechanism design.

artificial intelligence, data quality, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.47)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.81)

Add feedback

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Neural Information Processing SystemsFeb-18-2026, 18:02:08 GMT

Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap.

data quality, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

a7ce9b6a4db012cdaac28dd48989a17d-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-17-2026, 06:30:11 GMT

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.67)
Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(7 more...)

Add feedback

56d33021e640f5d64a611a71b5dc30a3-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 18:33:33 GMT

mechanism design, proceedings, revenue, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.47)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.41)

Add feedback

Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

Neural Information Processing SystemsFeb-6-2026, 07:42:23 GMT

Linear dimensionality reduction methods are commonly used to extract low-dimensional structure from high-dimensional data. However, popular methods disregard temporal structure, rendering them prone to extracting noise rather than meaningful dynamics when applied to time series data. At the same time, many successful unsupervised learning methods for temporal, sequential and spatial data extract features which are predictive of their surrounding context. Combining these approaches, we introduce Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future. We test DCA on synthetic examples and demonstrate its superior ability to extract dynamical structure compared to commonly used linear methods. We also apply DCA to several real-world datasets, showing that the dimensions extracted by DCA are more useful than those extracted by other methods for predicting future states and decoding auxiliary variables.

artificial intelligence, data quality, machine learning, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.43)

Add feedback

Canonical correlation regression with noisy data

Meza, Isaac, Singh, Rahul

arXiv.org Machine LearningDec-30-2025

We study instrumental variable regression in data rich environments. The goal is to estimate a linear model from many noisy covariates and many noisy instruments. Our key assumption is that true covariates and true instruments are repetitive, though possibly different in nature; they each reflect a few underlying factors, however those underlying factors may be misaligned. We analyze a family of estimators based on two stage least squares with spectral regularization: canonical correlations between covariates and instruments are learned in the first stage, which are used as regressors in the second stage. As a theoretical contribution, we derive upper and lower bounds on estimation error, proving optimality of the method with noisy data. As a practical contribution, we provide guidance on which types of spectral regularization to use in different regimes.

artificial intelligence, data quality, machine learning, (16 more...)

arXiv.org Machine Learning

2512.22697

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.61)

Add feedback

Error Correction Code Transformer

Neural Information Processing SystemsDec-25-2025, 19:13:46 GMT

Error correction code is a major part of the physical communication layer, ensuring the reliable transfer of data over noisy channels.Recently, neural decoders were shown to outperform classical decoding techniques.However, the existing neural approaches present strong overfitting, due to the exponential training complexity, or a restrictive inductive bias, due to reliance on Belief Propagation.Recently, Transformers have become methods of choice in many applications, thanks to their ability to represent complex interactions between elements.In this work, we propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.We encode each channel's output dimension to a high dimension for a better representation of the bits' information to be processed separately.The element-wise processing allows the analysis of channel output reliability, while the algebraic code and the interaction between the bits are inserted into the model via an adapted masked self-attention module.The proposed approach demonstrates the power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins, at a fraction of their time complexity.

electronic proceedings, error correction code transformer, name change, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.67)

Add feedback

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Pareek, Divyansh, Oh, Sewoong, Du, Simon S.

arXiv.org Machine LearningDec-17-2025

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.

contrastive learning, denote, matrix, (16 more...)

arXiv.org Machine Learning

2512.1423

Country:

North America > United States > Washington > King County > Seattle (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Education (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.45)
Information Technology > Data Science > Data Quality > Data Cleaning (0.34)

Add feedback

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

Budzianowski, Paweł, Wiśnios, Emilia, Góral, Gracjan, Kulakov, Igor, Petrenko, Viktor, Walas, Krzysztof

arXiv.org Artificial IntelligenceDec-3-2025

Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2509.17321

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.61)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Filters

Collaborating Authors

Data Cleaning

Revenue maximization via machine learning with noisy data

Revenue maximization via machine learning with noisy data

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

a7ce9b6a4db012cdaac28dd48989a17d-Paper-Datasets_and_Benchmarks_Track.pdf

56d33021e640f5d64a611a71b5dc30a3-Paper.pdf

Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

Canonical correlation regression with noisy data

Error Correction Code Transformer

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation